Sometimes It Pays to Lie on the Internet: The Semantic Web a

Mark Leighton Fisher on 2006-03-24T17:37:06

"Sometimes It Pays to Lie on the Internet." It sounds obvious, yet it holds the mystery as to why the Semantic Web technologies have not taken over the Web. So why hasn't the Semantic Web taken over the Net? These technologies have the ability to make searching much faster and more efficient. For example, I found that at Dublin Core metadata tags to pages made the correct pages turn up as the first few search hits.

Briefly, metadata is the information about a page – creation date, author's name, etc. Metadata can be included in the header (HEAD section) of a page so that it can be found easily. For many purposes, the interesting piece of metadata is the list of keywords/tags that help to classify the page into one or more categories. These keywords (or tags) are where the Semantic Web draws much of its power from, but this power can be poisoned to serve the desires of the deceitful to the point where it does *not* serve those who simply want to find something on the Internet.

The problem can viewed as a Systems Engineering problem -- where are the feedback loops, what feedback loops are missing, and are the existing feedback loops correctly configured? So where (and what) are the feedback loops on the Net? In the words of the "All The President's Men" movie Deep Throat, "Follow the money." When you are engaged in a commercial enterprise – as are much of the entities on the Internet – a very strong (and often the dominant) feedback loop is the money/power/fame feedback loop.

Inside healthy organizations, there is a premium placed on accurate and complete communication. ("Healthy" in the well-functioning sense -- a department within the KGB could have been said to have been healthy.) Organizations thrive on accurate information – and starve on inaccurate information – so much so that the flow of accurate information becomes a dominating feedback loop. Because accurate information is a major means to continued health, the accurate-information positive feedback loop is also a dominating money/power/fame positive feedback loop inside organizations. The recent case of Enron is interesting in this regard, as an organizationally healthier Enron would have meant more employees knowing the extent to which Enron was bending or breaking the law.

The keyword/tagging schemes used by the Semantic Web technology range from the simple and user-specific to sophisticated keywording schemes worthy of the U.S. Library of Congress (or at least folksonomies like those of Flickr et.al.) Semantic Web technologies then use this web page metadata for webpages to then draw conclusions about those pages. But, bad metadata leads to bad conclusions (GIGO – Garbage In, Garbage Out). Why would anyone deliberately put bad metadata on their pages? Because it pays – sometimes it pays very well.

Why does it pay to put bad metadata on webpages? Because if you can lure a bunch of people to your pages through deceptive metadata, you may be able to turn some of those people into consumers of your products/services. This is the same proposition used by spammers – throw around enough of it and some of it is bound to stick. Throw enough of the bad/deceptive metadata on your pages, and some of it will be sticky enough to capture additional customers. Before Google, it had gotten to the point where you had cause to wonder how many of your search engine hits were going to be for items entirely unrelated to the subject of your search.

So how does this affect non-commercial uses of the Internet? IMHO, the closer the use is to a business use, the worse is the problem of bad metadata. For example, the high-energy physics community does not see much of a problem, while movie fans run into the bad metadata problem all the time. (When GE starts marketing the Home Synchrotron 9000 with optional electron sorter, then the high-energy physics people will have something to worry about.)

To market or sell on the Internet or elsewhere ofttimes one needs only to be persuasive – not truthful. Truthful information, however, is vital for Semantic Web technologies – without (enough) truthful information, the Semantic Web technologies (still in their infancy on the Internet) will undergo "failure to thrive."

Will we see widespread adoption of the Semantic Web technologies? Sure. Where will we see them? Inside Intranets and between co-operating groups on the Net. It is a failure of human character, not of the Semantic Web technologies, that they will see only marginal use on the major portion of the public Internet. Whether you view it as a hopefully temporary lack of character on the part of the general public, or as more evidence of Original Sin, the fact remains many members of the Internet's population will not always provide you with accurate information, especially when they stand to gain by deceiving you. Therein lies the problem – when technology comes into conflict with human nature, human nature wins every time. I am looking forward to the widespread adoption of Semantic Web technologies, but I have little expectation that those self-same technologies will prove to be a panacea for general Web searches.

It comes down to a matter of trust. Semantic Web technologies were designed to function in a trusting environment – academics don't usually expect that other academics will deliberately lie to them on technical matters. (Who is favored to become Department Chair is another matter.) The Public Internet, on the other hand, includes those who would deceive and delude the rest of us if they thought they stood to gain by their falsehoods. Semantic Web technologies will need to engage both positive and negative feedback loops (positive for accurate information and negative for inaccurate information) so that these technologies only perform accurate reasoning about the information on the Web – and what are search hits but the results of reasoning about whether certain webpages are relevant to our search request?

Google's PageRank algorithm, and others of that ilk, will continue to help the most with Internet searches for the near future, as they harness the power of public opinion in the form of positive and negative feedback loops – the false publicity of botnet-pumped link counts is not enough to outweigh the mass of known-good links by the multitudes of real people on the Internet. Only by harnessing something like Google's PageRank can the Semantic Web technologies become a major force in searching through – and reasoning about – all of the World Wide Web.


high-energy physics and movies

mr_bean on 2006-03-26T10:54:45

I think the message is get involved in physics or
something else not mass-oriented and stay away
from the movies. And when GE markets the
Cyclotron move on to something else which is
free.
Seriously though, I agree the social aspects of
technology need as much thought as the technical
ones.